This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code.
Try executing this chunk by clicking the Run button within the chunk or by placing your cursor inside it and pressing Ctrl+Shift+Enter.
Add a new chunk by clicking the Insert Chunk button on the toolbar or by pressing Ctrl+Alt+I.
When you save the notebook, an HTML file containing the code and output will be saved alongside it (click the Preview button or press Ctrl+Shift+K to preview the HTML file).
White Wine Quality Analysis
In the subsequent section we are going to analyse the dataset for the white wine.
Loading the dataset into the R. Analysing the dataset length, variables names and its summary.
## [1] "Dataset Length : "
## [1] 4898 13
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
## 'data.frame': 4898 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.:1225 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700
## Median :2450 Median : 6.800 Median :0.2600 Median :0.3200
## Mean :2450 Mean : 6.855 Mean :0.2782 Mean :0.3342
## 3rd Qu.:3674 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900
## Max. :4898 Max. :14.200 Max. :1.1000 Max. :1.6600
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 2.00
## 1st Qu.: 1.700 1st Qu.:0.03600 1st Qu.: 23.00
## Median : 5.200 Median :0.04300 Median : 34.00
## Mean : 6.391 Mean :0.04577 Mean : 35.31
## 3rd Qu.: 9.900 3rd Qu.:0.05000 3rd Qu.: 46.00
## Max. :65.800 Max. :0.34600 Max. :289.00
## total.sulfur.dioxide density pH sulphates
## Min. : 9.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.:108.0 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100
## Median :134.0 Median :0.9937 Median :3.180 Median :0.4700
## Mean :138.4 Mean :0.9940 Mean :3.188 Mean :0.4898
## 3rd Qu.:167.0 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500
## Max. :440.0 Max. :1.0390 Max. :3.820 Max. :1.0800
## alcohol quality
## Min. : 8.00 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.40 Median :6.000
## Mean :10.51 Mean :5.878
## 3rd Qu.:11.40 3rd Qu.:6.000
## Max. :14.20 Max. :9.000
Looking at the dataset summary above, we can see that other than the factor variable(‘quality’), we 11 variables around which we have 4898 observations.
Univariate Analysis:
In this section we will perform univariate analysis of each variable in the white wine dataset.
1. Fixed Acidity

## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.800 6.300 6.800 6.855 7.300 14.200
Not much useful can be inferred from it, but we can see that most sample has acidity between 6-7, with maximum of 10 units.
2. Volitile Acidity

## <ScaleContinuousPosition>
## Range:
## Limits: 3 -- 12
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0800 0.2100 0.2600 0.2782 0.3200 1.1000
It is also a normal distribution, with maximum around 0.3unit.
3. Citric Acid

## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2700 0.3200 0.3342 0.3900 1.6600
4. Residual Sugar

## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.700 5.200 6.391 9.900 65.800
It can be approximated as exponential decay towards poisive axis. It shows that around half of wine samples has a low amount of residual sugar in it.
5. Chlorides

## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600
The width of the curve is quit narrow, means that chloride content is more or less similar in most of the samples. Very few of them have quiet high amount of chloride, which can regarded as outlier. This can be verified by statistical results, showing 3rd quantile at 0.05.
5. Free Sulphur

## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 23.00 34.00 35.31 46.00 289.00
This has a quiet wider distribution, means free sulfur dioxide amount is quiet varied across the sample. This can be significantly related to the sample quality.
6. Total Sulphur Dioxide

## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.0 108.0 134.0 138.4 167.0 440.0
Just like free sulfur dioxide, total sulfur dioxide distribution is also quiet spread across the sample, with peak around 120 units. AThis can also be quiet related to the quality of sample wine.
7. Density

## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9917 0.9937 0.9940 0.9961 1.0390
Unlike the other features discussed above, the density distribution has very low number of outliers. This is understood since density of sample cannot by an error or exception raise/fall significantly.
8. pH

## <ScaleContinuousPosition>
## Range:
## Limits: 0 -- 25
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.720 3.090 3.180 3.188 3.280 3.820
pH is property which depends on concentration of other ingredients, wich affects the overall acidity of the sample. If other ingredients vary, pH will vary, hence it has a wide spectrum in itself. For a good quality wine, a high value pH (means lesser acidic) is required.
9.0 Sulphates

## <ScaleContinuousPosition>
## Range:
## Limits: 0 -- 25
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2200 0.4100 0.4700 0.4898 0.5500 1.0800
It has peak around 0.5, with significant positive outliers.
10. Alcohol

## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.50 10.40 10.51 11.40 14.20
THe alcohol content is between 9% to 14%, mostly dominated in 10-12% region. Higher is the quantity of alcohol better ie the quality of wine as per general rule. it is one of the most important feature deciding the quality of sample.
11. Quality

##
## 3 4 5 6 7 8 9
## 20 163 1457 2198 880 175 5
It can be seen that sample distribution is highly non-uniform, with most samples are of around middle quality. Both ver high quality and very low quality has significantly less samples (less than 5%). Hence this is not a good sample to analyse and arrive at correct conclusions. Because of lack of samples, better comparisons of features across different qualities cannot be performed. Theoratically it is suggested that, it is highly advisable to study such samples.
Univariate Analysis: Results
What is the structure of the database?
There are 4898 samples in the dataset with 12 features (fixed Acidity, Volatile Acidity, Citric Acid, Residual Sugar, Chloride, Free Sulphur Dioxide, Total Sulphur Dioxide, Density, pH, Sulphates, Alcohol and Quality). The variable quality is an ordered factor variables with the following levels.
Worst —–> Best
3 4 5 6 7 8 9
What are important feature of interest in the Database?
The most important features of interest in the database is the ‘quality’ parameter. The other important features that looks relevant are ‘Alcohol’, ‘density’, ‘chlorides’ and ’sulphur dioxide’and their impact on quality of the sample.
Any other inferences?
Since the saples corresponding to quality no. 3,4,8 and 9 is very low, especially 9(best quality) sample, the comparison among them cannot be correctly justified. For a healthy comparison and conclusions, it os required to have adequate samples in each level of a catagory.
Bivariate plot section.
Uptil now we have only 1 ordered factor variable i.e. ‘quality’. For a better bivariate and multivariate analysis, a more than one factor variable greatly helps to get more insight. But in this sample almost all of the variable looks independent in contributing towards quality. But ‘density’ is an another parameter that can have a relationship with others independent parameter. So i can convert density into an ordered factor by splitting it into ranges.
Creating new variable ‘density_factor’
summary(ds$density)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9917 0.9937 0.9940 0.9961 1.0390
ds$density_factor = cut(ds$density, c(0.986, 0.990,
0.994,0.998, 1.04))
summary(ds$density_factor)
## (0.986,0.99] (0.99,0.994] (0.994,0.998] (0.998,1.04]
## 373 2244 1746 535
Initiating Bivariate Analysis
ggpairs(ds[sample.int(nrow(ds),1000),2:14])

From above it can be inferred that the most variables has minimum correlation with others except than quality, or density to some extent. Other than that some correlation is expected between sugar and density, sulphur dioxide
Generating correlation MAtrix :
Correlation matrix will give us a slight picture of relationship of one variable with other, that will help us in choosing the right variables for further analysis

It can be seen that ‘quality’ of wine depends strongly on alcohol content, density and also to amount of chloride, sulphur and acidity of sample. Hence these re important variable that can be picked for analysis.
1. Residual Sugar vs Sulphur Dioxide
Residual Sugar and sulfur Dioxide showed higher correlation with each other statistically, hence it would be justified to plot and find the graphical relationship.

##
## Pearson's product-moment correlation
##
## data: ds$total.sulfur.dioxide and ds$residual.sugar
## t = 30.669, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.3776791 0.4246712
## sample estimates:
## cor
## 0.4014393
THe correlation of 0.4 shoes some relationship between total sulphur dioxide and sugar. This may be because total sulphur dioxide has two parts : Free and Bound. The bound part is the one which is also produced during fermentation and some parts also get binded with the sugar.
Ploting approximate relationship

2. Residual Sugar vs Density

##
## Pearson's product-moment correlation
##
## data: ds$density and ds$residual.sugar
## t = 107.87, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8304732 0.8470698
## sample estimates:
## cor
## 0.8389665
High correlation of 0.84 is beacuse sugar concentration of sugar is directly proportional to density. Higher the sugar amount higher is the wieght of sample. The smooth approximate curve following depicts the exactly the same relationship

3. Alcohol vs Density
Similarl alcohol is lighter than water, hence higher alcohol must corresponds to lower density. Thisis exactky depicted in graph below.

##
## Pearson's product-moment correlation
##
## data: ds$density and ds$alcohol
## t = -87.255, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.7908646 -0.7689315
## sample estimates:
## cor
## -0.7801376
4. Sulphur Dioxide vs Density

##
## Pearson's product-moment correlation
##
## data: ds$total.sulfur.dioxide and ds$density
## t = 43.719, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.5094349 0.5497297
## sample estimates:
## cor
## 0.5298813
5. Sulphate vs Density

##
## Pearson's product-moment correlation
##
## data: ds$density and ds$sulphates
## t = 5.2269, df = 4896, p-value = 1.795e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.04658388 0.10228621
## sample estimates:
## cor
## 0.07449315
Sulphates has virtually no relationship with density.
6. Chlorides vs Density

##
## Pearson's product-moment correlation
##
## data: ds$density and ds$chlorides
## t = 18.624, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2308679 0.2831779
## sample estimates:
## cor
## 0.2572113
Ploting relationship with Quality variable
This section will help in evaluating the exact contribution of each variable to the quality of the wine.
1. fixed Acidity vs Quality

2. Volatile Acidity vs Quality

Volitile acidity, shows minimal variance with quality of sample. It shows high fluctualtion and no trend.
3. Citric Acid vs Quality

It looks as if higher quality of wine, must have slightly higher amount of citric acid, as compared to other sample. Though this amount must not too high as it will then reduce the pH of the wine.
4. Residual Sugar vs Quality

5. Chlorides vs Quality
###Chlorides look strongly related with wine quality. The trend pattern shows that as the quality of wine increases, chloride concentration decreases.
6. Sulfur Dioxide vs Quality

7. Density vs Quality

8. pH vs Quality
####It can be seen that, the good quality wine has higher pH valur, a significantly lower density (because of high alcohol, low SO2, low sugar as descussed in previous section) and slightly reduced sulphur dioxide. The SO2 content in poorest qualit sample is quiet high.
9. Sulphates vs Quality

The sulphates concentrations looks to have negligeble effect on quality of wine sample. It was seen earlier that its relationship with density is also non-significant.
10. Alcohol vs Quality

Alcohol, as expected has the highest correlation with the quality of the sample. THe best quality sample has the highest mean amount of alcohol. But there is strange observation to see that, the quality index 3 and 4 has higher mean alcohol than that of quality index 5. I feel that despite good alcohol amt it had other factors (eg high chloride or low pH, high SO2 etc) that contrbuted to lower quality index.
Bivariate Plot Analysis: Conclusions
How did the feature(s) of interest vary with other features in the dataset?
This section was divide in two parts. In the first part the important features were compared with other features in the dataset. In the second part, all of the features were then analysed with the quality of the wine samples, to study how they varied with increasing quality.
Firstly we found that the residual sugar was strongly related with total sulfur dioxide and density.Also higher sulphur dioxide and high chlorides corresponded to higher density. Features like Sulphates, free acidity has least relationchip with any other feature and even on quality.
What was the strongest relationship you found?
The strongest relationship found was between residual sugar and density, and of alcohol content and density.
Which feature has the least relationship with other?
Sulphates has the least relationship with any other features in the dataset.
Multivariate Plots Section
In the next section, i will analyse the samples features like chlorides, SO2, pH, alcohol etc in terms of quality and density of the sample. This is because only these two variable are maximum dependent on other variables as compared to other. All other variables are significantly independent with each others. The density is expressed in terms of newly created factor variable density_factor, in which i have catagorised density in terms of range.
1. Mean Alcohol vs quality @Different Density

2. Mean Chlorides vs quality @Different Density

Alcohol and chlorides has strong correlation with quality of sample but in opposite direction. Alcohol has stong positive correlation, while chloride has negative correlation, as depeicted in graphs. MOreover, alcohol has a strong relationship with density, with higher density having low alcohol and viceversa. But chlorides has very less relationship with density.
3. Mean Sulphur Dioxide vs quality @Different Density

4. Mean pH vs quality @Different Density

pH has neligeble correlation with density of the sample. But it do has good relation with density. Higher quality wine demands ro be lesser acidic i.e. hiher pH, ehich is almost reflected in the plot.
5. Mean Citric Acid vs quality @Different Density

A good quality wine demands some higher amount of citric acid in it, which gives little non sweat but sour taste to the wine. Also it is seen that citric acid has good correlation with density, having low amount in samples with low density and vice-versa.
6. Mean Residual Sugar vs quality @Different Density

Sugar level cannot be high for a good quality wine. Which is what exactly seen in the plot. With lower density samples has a low amount of sugar also.
7. Mean Fixed Acidity vs quality @Different Density

Very lesser correlation seen, though the acidity level has a slight decrease with increasing quality.
8. Mean Sulphates vs quality @Different Density

Sulphates overall has minimal effect over quality. Though some hiher density, high quality samples has high sulphate value.
8. ScatterPlot- Alcohol vs quality @Different Density
Plotting chlorides and alcohol levels at different levels of quality

THis is quiet zig-zag and almost unclear. Hence will reaarange the quality factor bucket, and also generate smooth line.

9. ScatterPlot- SO2 vs quality @Different Density

Creating Normalised DataSet
normalize <- function(x) {
return(abs((x - min(x))/(max(x)-min(x))))
}
dsNorm <- as.data.frame(lapply(ds[,2:12], normalize))
dsNorm$quality <- ds$quality
dsNorm$density_factor <- ds$density_factor
By creating a normalised dataset with value (0,1), it enables us in ploting all (or most) of the features in a single chart. In the subsequent section, i have created two charts, where one charts shows mean variation of features that do not relate with the quality. The other chart shows for dominent features or those influencing quality by small or high amount.


Multivariate Analysis: REsults
Final Plots and Summary
Plot One : Factors with less/No impact on quality

Discription:
This plot contains all important variables/factors that has some correlation with the quality of the wine. It can be seen that there is a falling trend for Sulfur Dioxide, DEnsity and CHloride with increasing quality. WHeras Alcohol, Citric Acid has rising trend. Variables like valitile and fixed acidity, free SO2 and Sulphates has minimal correlation with quality of wine. Sulphates content has the least correlation of all and the content is almost similar.
Plot Two : Alcohol Quality at different density

Discription:
White Wine with higher alcohol content has the higher quality index. This clubbed with density, it is found that higher quality wine has high value of alcohol with lesser overall density.
Plot Three : CHloride vs Alcohol at different qualities

Discription:
We see the trend that the level of Chlorides decreases with the rise of alcohol levels by volume. This is prominent in medium to high quality wines. For low quality wones, we notice a rise in Chloride levels beyond the 12% alcohol mark. We notice that the Chloride levels across all wine qualities are nearly the same in the 10%-12% alcohol band. However, below 10%, and higher than 12%, lower quality wines have higher levels of Chlorides. This may be a case of hidden factors though. M. S Coli et. al show that the Chloride levels in red wines are influenced by terroir, and grape type. Since these are not provided in the data set, it could very well be a case of the scorers preferring wine from certain areas of the world.
Reflection
Based on the EDA and further analysis that I did for this dataset, I am convinced that Alcohl percentage is the most important factor to decide the quality of White wine. One important factor that contributes to Alcohol percentage is the remaining sugar in wine after fermentation so that the more sugar left after fermentation, the less the percentage of alcohol will be in the wine.
Other important factors for deciding the quality of a white wine are SO2 and Chlorides. Both has negative effect on the quality of white wines!
I used correlation values to find relationships between attributes. Where significant skew was present, I took logs. I found bimodality in a feature, but it appeared that it was not related to quality; it is likely that some confounders are causing the bimodality, and this is worth investigating further.
With all the quality levels, the plots started looking messy. To make things clearer, I re-categorised wine quality into three categories. This helped quite a lot. I found interesting trends in the relationships between wine quality, alcohol levels, and chloride levels.
In future i could develop a predtive model in order to perform the test to predict the wine sample’s quality based on the given set of fetures. This will perhaps more open the mind, making us to think better. This dataset has many relationship and correlations can be extracted from data and in this project we investigated very obvious relationships between wine qualities and its properties.